Search CORE

80 research outputs found

Core Lexicon and Contagious Words

Author: Blanchard Philippe
Sharoff Serge
Volchenkov Dmitri
Publication venue
Publication date: 01/01/2003
Field of study

We present the new empirical parameter

f_c

, the most probable usage frequency of a word in a language, computed via the distribution of documents over frequency

x

of the word. This parameter allows for filtering the core lexicon of a language from the content words, which tend to be extremely frequent in some texts written in specific genres or by certain authors. Distributions of documents over frequencies for such words display long tails as

x>f_c

representing a bunch of documents in which such words are used in abundance. Collections of such documents exhibit a percolation like phase transition as the coarse grain of frequency

\Delta f

(flattening out the strongly irregular frequency data series) approaches the critical value

f_c

.Comment: RevTex, 4 pages, 2 figure

arXiv.org e-Print Archive

Publications at Bielefeld University

Syntactic Knowledge via Graph Attention with BERT in Machine Translation

Author: Dai Yuqian
de Kamps Marc
Sharoff Serge
Publication venue
Publication date: 22/05/2023
Field of study

Although the Transformer model can effectively acquire context features via a self-attention mechanism, deeper syntactic knowledge is still not effectively modeled. To alleviate the above problem, we propose Syntactic knowledge via Graph attention with BERT (SGB) in Machine Translation (MT) scenarios. Graph Attention Network (GAT) and BERT jointly represent syntactic dependency feature as explicit knowledge of the source language to enrich source language representations and guide target language generation. Our experiments use gold syntax-annotation sentences and Quality Estimation (QE) model to obtain interpretability of translation quality improvement regarding syntactic knowledge without being limited to a BLEU score. Experiments show that the proposed SGB engines improve translation quality across the three MT tasks without sacrificing BLEU scores. We investigate what length of source sentences benefits the most and what dependencies are better identified by the SGB engines. We also find that learning of specific dependency relations by GAT can be reflected in the translation quality containing such relations and that syntax on the graph leads to new modeling of syntactic aspects of source sentences in the middle and bottom layers of BERT

arXiv.org e-Print Archive

GATology for Linguistics: What Syntactic Dependencies It Knows

Author: Dai Yuqian
de Kamps Marc
Sharoff Serge
Publication venue
Publication date: 22/05/2023
Field of study

Graph Attention Network (GAT) is a graph neural network which is one of the strategies for modeling and representing explicit syntactic knowledge and can work with pre-trained models, such as BERT, in downstream tasks. Currently, there is still a lack of investigation into how GAT learns syntactic knowledge from the perspective of model structure. As one of the strategies for modeling explicit syntactic knowledge, GAT and BERT have never been applied and discussed in Machine Translation (MT) scenarios. We design a dependency relation prediction task to study how GAT learns syntactic knowledge of three languages as a function of the number of attention heads and layers. We also use a paired t-test and F1-score to clarify the differences in syntactic dependency prediction between GAT and BERT fine-tuned by the MT task (MT-B). The experiments show that better performance can be achieved by appropriately increasing the number of attention heads with two GAT layers. With more than two layers, learning suffers. Moreover, GAT is more competitive in training speed and syntactic dependency prediction than MT-B, which may reveal a better incorporation of modeling explicit syntactic knowledge and the possibility of combining GAT and BERT in the MT tasks

arXiv.org e-Print Archive

Advanced corpus solutions for humanities researchers

Author: Hartley Anthony
Sharoff Serge
Stephenson Paul
Wilson James
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

A Robust statistical model of word frequencies

Author: Ingelby Michael
Sharoff Serge
Publication venue: Strathmore University
Publication date: 01/08/2019
Field of study

Paper presented at the 5th Strathmore International Mathematics Conference (SIMC 2019), 12 - 16 August 2019, Strathmore University, Nairobi, KenyaFor the purposes of language teaching or automatic language processing it is important to know how frequent a word is. However, a simple procedure counting the number of times a word occurs in a collection of texts leads to many unfortunate artefacts because some words occur too often in a small number of texts leading to frequency bursts. Our task in this paper is to introduce a statistical model which uses methods from robust statistics to estimate the frequencies of words in a collection of texts.University of Leeds, United Kingdom

SU+ Digital Repository

BERT goes off-topic : investigating the domain transfer challenge using genre classification

Author: Roussinov Dmitri
Sharoff Serge
Publication venue
Publication date: 10/12/2023
Field of study

While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at https://github.com/dminus1/genr

University of Strathclyde Institutional Repository